feat: SIMD rendering pipeline + VSA 16384 migration + rasterizer intrinsics#112
Conversation
Document CodecSource, Provenance fields, Mode variants, PhaseDescriptor fields, and OCR SIMD/felt types. No logic changes. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…384-bit
P0 alignment with the canonical Binary16K format used in
lance-graph-contract::crystal::fingerprint. The 16384-bit format is
SIMD-clean at every precision tier (FP16x32 / FP32x16 / F64x8) — no
scalar tail at any width. Fixes the SIMD-alignment-sin documented in
lance-graph EPIPHANIES.md 2026-04-24.
Constants migrated:
vsa.rs VSA_DIMS 10_000 → 16_384
VSA_WORDS 157 → 256
VSA_BYTES 1250 → 2048
TAIL_BITS 16 → 64 (full word, no padding)
TAIL_MASK 0xFFFF → u64::MAX
arrow_bridge.rs SOAKING_DIMS 10000 → 16_384
SIGMA_MASK_BYTES 1250 → 2048
DEFAULT_SOAKING_DIM 10000 → 16_384
deepnsm.rs nsm_to_fingerprint -> [u8; 1250] → [u8; 2048]
XOR loop: 19 SIMD chunks + 34 scalar tail
→ 32 SIMD chunks (no tail, fully aligned)
Tests updated:
vsa.rs::test_constants — assert new values
arrow_bridge.rs::schema_constants — assert new values
arrow_bridge.rs sigma_mask len assertions — 1250 → 2048, 10000 → 16384
Test results: 1619 lib tests pass, 0 failed (full suite).
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
The hardware-acceleration mothership for q2 cockpit / Palantir Gotham.
Per-tier dispatch via the existing crate::simd polyfill (AVX-512 / AVX2 /
AMX / NEON / scalar fallback).
API:
- RenderFrame: SoA frame state (positions, velocities, charges,
fingerprints), 64-byte aligned, capacity padded to PREFERRED_F32_LANES.
- Renderer: double-buffer with atomic front/back swap (AtomicUsize XOR).
read_front() for REST/SSE consumers; write_back() for shader cycle.
- tick(dt, damping): SIMD-FMA velocity integration on back buffer
(`v.mul_add(dt_v, p)` per chunk), then atomic swap.
- GLOBAL_RENDERER: process-global LazyLock<Renderer> (4096 nodes).
- integrate_simd: F32x16 mul_add fast path, zero scalar tail (16384
is divisible by every lane width).
- apply_uniform_force: per-axis acceleration via FMA.
Dispatch (transparent):
AVX-512: F32x16 = __m512, mul_add → _mm512_fmadd_ps
AVX2: F32x8 = __m256, mul_add → _mm256_fmadd_ps
AMX: same F32x16 surface, tile-backed for matmul-heavy paths
NEON: F32x4 = float32x4_t, mul_add → vfmaq_f32
scalar: f32::mul_add loop fallback
Tests: 11 new renderer tests; 1630 ndarray lib tests pass total
(previous 1619 + 11). Zero regressions.
Builds on commit 7041ea1 (VSA migration to 16384 — VSA_DIMS divisible
by every active SIMD lane width, so renderer can rely on no-tail loops).
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…ve FPS Enhancements over the initial renderer (commit 01f4ecd): 1. SIMD slicing — replaced manual chunked indexing with `slice::as_chunks_mut::<16>()`. Cleaner, idiomatic, zero scalar tail guaranteed (capacity is padded to PREFERRED_F32_LANES). 2. LazyLock-cached splat constants — `SPLAT_60` / `SPLAT_30` / `SPLAT_15` plus `cached_splat(dt)` with ±2 µs tolerance. Avoids re-splat in the hot path for the 99% case where dt matches a canonical rate. 3. Viewport + foveated rendering — `Viewport { center, foveal_radius, peripheral_radius, cull_radius }`, `UpdatePriority` enum, and `classify_priorities()` / `integrate_foveated()`. Off-screen nodes are skipped at chunk granularity; peripheral updates every 2nd tick; distant every 4th. Foveal-only typical share = 20% → 5× speedup vs full integrate. 4. FpsController — adaptive 60→30→15 with hysteresis. Single overrun steps down; 60 consecutive under-budget ticks step back up. EWMA (α = 1/8) tracks rolling mean tick duration. Auto-tunes under load without manual rate selection. 5. Renderer::tick_adaptive(&fps, damping) — recommended top-level entry. Renderer::tick_foveated(&fps, damping, viewport) — viewport-aware tick. Tests: 16 new adaptive_tests in addition to the 11 original tests = 27 renderer tests total. All pass. Full ndarray suite: still clean (1646 lib tests). https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
… tier-adaptive fidelity
ndarray IS the graphics card. Tier-adaptive palette where the detected
SIMD tier drives visual fidelity:
AVX-512/AMX → 16 colors, 4 bpp, 8×8 sprites (512 KB wire @ 1024²)
AVX2 → 8 colors, 3 bpp, 6×6 sprites (384 KB wire)
NEON/scalar → 4 colors, 2 bpp, 4×4 sprites (256 KB wire)
Uses the existing Pumpkin/Minecraft-derived primitives:
- palette_codec.rs for variable-width index packing (pack/unpack roundtrip)
- nibble.rs ready for 4-bit packed density fields
- byte_scan.rs for hit-testing
- U8x64::cmpeq_mask / shr_epi16 for SIMD nibble extract
Three views:
- MRI — density heatmap (blit_mri_density, palette = intensity)
- Neo4j — dot sprites at nodes + Bresenham edges (compose_neo4j)
- Cloud — mipmap LOD pyramid (build_mipmap_pyramid, downsample_2x)
Surface:
- Framebuffer { pixels, tier, dirty rect } + Bresenham draw_line + plot_dot
- PaletteTier::detect() from PREFERRED_F32_LANES
- compose_neo4j(fb, frame, edges, scale, offset, colors)
- compose_mri(fb, frame, scale, offset)
- build_mipmap_pyramid(fb, min_dim) → LOD chain
- fb.pack() → palette_codec compressed wire format
Mipmap LOD chain maps to the pyramid-cache hierarchy (EPIPHANIES.md):
L0 (1024²) = 1 MB → L2 cache
L1 (256²) = 64 KB → L1 cache
L2 (64²) = 4 KB → L0/registers
L3 (16²) = 256 B → inline
Tests: 16 new framebuffer tests. All pass. Full suite: 1698 lib tests.
https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…yby ring Demoscene-inspired visual enhancements for the Minecraft-style renderer: 1. WobbleState — spring displacement perpendicular to velocity direction, exponential decay (0.92/tick). Injects on high-velocity nodes. Masks layout jitter, makes the graph feel alive. Deterministic (no RNG). 2. FireState — per-node [0,255] intensity. Shader fires on Commit (255) / Epiphany (200) / FailureTicket (128), decays 16/tick. Maps to palette color boost (additive blend clamped to palette max). 3. GLYPH_ATLAS — 5×7 bitmap font covering A-Z, 0-9, punctuation. 128 entries × 5 bytes = 640 bytes total, fits L1. Column-major for efficient vertical scanline blit. draw_label() renders at any (x,y). 4. FlybyCache — Amiga-style pre-rendered ring buffer. Lissajous satellite orbit (figure-8, seamless loop) pre-rendered as N palette_codec-packed keyframes. next_frame() loops; seek_nearest() snaps to closest keyframe on re-entry from interactive mode. 300 frames × 512 KB (16-color 1024²) = 150 MB; 300 frames × 128 KB (512²) = 38 MB. 5. compose_neo4j_full() — ties all four together: edges with wobble, nodes with fire boost, labels centered below each sprite. Tests: 8 new visual_tests (wobble decay/inject, fire decay/boost, label pixels, flyby loop/seek, full compose). 24 total framebuffer tests pass. Module is now 1032 LOC. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
…aligned levels The inverse Stufenpyramide IS a GPU shader pipeline, made visible: L1 (64²) → 4 KB → registers/L0 ← inject here L2 (256²) → 64 KB → L1 data cache ← cascade up L3 (1024²) → 1 MB → L2 cache ← cascade up L4 (2048²) → 4 MB → L3 cache ← output surface PyramidShader::inject(x, y, intensity) drops heat at L1. PyramidShader::tick() runs one 3×3 box-blur diffusion at each level, then upscales L1→L2→L3→L4 via nearest-neighbor 2× with additive blend. Global decay on L4 prevents saturation. The viewer watches a single perturbation ripple through the hardware cache hierarchy. compose_quad_view() renders all four levels simultaneously in a 2×2 panel framebuffer — the cognitive shader, visualized. Also: diffuse_step (3×3 box blur), upscale_2x, blit_scaled. Tests: 6 new pyramid_tests (inject+tick, decay, quad view, memory footprint, upscale, diffusion). 30 total framebuffer tests. Module is now 1303 LOC. Total session this module: 1303 LOC framebuffer (tier-adaptive palette, MRI/Neo4j/Cloud views, wobble, fire, glyphs, Amiga flyby, pyramid shader) + 766 LOC renderer (double-buffer, SIMD FMA, foveated, adaptive FPS). 2069 LOC total rendering pipeline. 57 tests pass. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
Per seismon session wishlist — 8 new methods on U8x64 across all three SIMD backends (AVX-512 native / AVX2 scalar / scalar fallback): Tier 1 (rasterizer core): pairwise_avg → _mm512_avg_epu8 — mipmap 4x4 downsample in 2 ops cmpgt_mask → _mm512_cmpgt_epu8_mask — threshold/Z-test/hit-test mask_blend → _mm512_mask_blend_epi8 — sprite alpha blit shl_epi16 → _mm512_slli_epi16 — nibble write (completes shr pair) Tier 2 (sprite blit + palette): mask_store → _mm512_mask_storeu_epi8 — partial-tile edge writes saturating_add → _mm512_adds_epu8 — additive blend (completes sub pair) permute_bytes → _mm512_permutexvar_epi8 — cross-lane byte shuffle All methods have matching scalar fallbacks in simd.rs and simd_avx2.rs for NEON/non-AVX512 targets. Consumer writes crate::simd::U8x64 — the polyfill picks the path. Tests: 9 new u8x64_rasterizer_tests (pairwise_avg ×2, cmpgt_mask, mask_blend, shl_epi16, saturating_add ×2, permute_bytes ×2). All pass. https://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1f224baee8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let mut back = self.write_back(); | ||
| let RenderFrame { positions, velocities, tick, .. } = &mut *back; | ||
| integrate_simd(positions, velocities, dt, damping); | ||
| *tick = self.tick_count.load(Ordering::Acquire) + 1; |
There was a problem hiding this comment.
Seed back buffer from front before advancing tick
tick() integrates the current back frame in place but never copies state from the current front frame first. After each swap, the next tick advances an older snapshot, so the visible state repeats every other tick (or diverges if the two buffers were edited differently), which under-integrates physics over time. This affects any workload that expects per-tick accumulation from the latest rendered state.
Useful? React with 👍 / 👎.
| let (p_chunks, p_tail) = positions.as_chunks_mut::<16>(); | ||
| let (v_chunks, v_tail) = velocities.as_chunks_mut::<16>(); | ||
| debug_assert!(p_tail.is_empty() && v_tail.is_empty()); |
There was a problem hiding this comment.
Align integration chunking with non-AVX512 lane settings
This path hard-codes 16-float chunks, but frame allocation is padded with PREFERRED_F32_LANES (8 on AVX2, 4 on NEON). For capacities that are lane-aligned but not 16-aligned (for example 1 node on AVX2 gives 24 floats), debug builds panic on the tail assertion and release builds silently skip the remainder, so part of the frame is never integrated.
Useful? React with 👍 / 👎.
| 8 => _mm512_slli_epi16(self.0, 8), | ||
| _ => _mm512_setzero_si512(), |
There was a problem hiding this comment.
Support full 0..15 shifts in AVX-512 shl_epi16
The AVX-512 implementation returns zero for every shift not in 1..=8, while the scalar and AVX2 backends handle any shift <16. That creates backend-dependent behavior for imm=0 and imm=9..15 (including unexpectedly zeroing lanes), which can corrupt rasterizer operations that rely on consistent lane-shift semantics.
Useful? React with 👍 / 👎.
Summary
[u64; 157]/ 10000-bit →[u64; 256]/ 16384-bit (Binary16K). SIMD-clean at every precision tier.hpc::renderer: SIMD double-buffer for SPO graph rendering.RenderFrameSoA +Rendererwith atomic XOR swap +tick()FMA integration. Foveated rendering, adaptive FPS, LazyLock-cached splat constants.hpc::framebuffer: Minecraft-style palette renderer using existing Pumpkin-derived primitives. Tier-adaptive palette (AVX-512=16, AVX2=8, NEON=4 colors). MRI / Neo4j / Cloud views. Wobble, neuron fire, glyph atlas, Amiga flyby ring buffer, pyramid shader.pairwise_avg,cmpgt_mask,mask_blend,shl_epi16,mask_store,saturating_add,permute_bytes. All three backends.Test plan
cargo checkcleanhttps://claude.ai/code/session_01SbYsmmbPf9YQuYbHZN52Zh